── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract() masks magrittr::extract()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::set_names() masks magrittr::set_names()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)library(data.table)
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
Download, read, and get familiar with an external dataset.
Step through the EDA “checklist” presented in class
Practice making exploratory plots
Assignment Description
We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).
A primer on particulate matter air pollution can be found here.
Your assignment should be completed in Quarto or R Markdown.
Steps
Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data table. For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
2002: There are 15976 observations for 20 variables. The first date in the data set is 01/05/2022 from Livermore. The last date in the data set is 12/31/2022 from Woodland-Gibson Road. The range of daily mean PM 2.5 concentration is 104.30 with a mean of 16.12.
2022: There are 56140 observations for 20 variables. The first date in the data set is 01/01/2022 from Livermore. The last date in the data set is 12/31/2022 from Woodland-Gibson Road. The range of daily mean PM 2.5 concentration is 304.7 with a mean of 8.52.
Looking at the key variable (PM 2.5 concentration), the minimum is -2.20 but it does not make sense to have negative values for concentration (the minimum should be 0).
summary(data_2022$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.20 4.20 6.90 8.52 10.80 302.50
Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
merged_data <-rbindlist(list(data_2002[, year :=2002],data_2022[, year :=2022]))merged_data$PM2.5<- merged_data$`Daily Mean PM2.5 Concentration`merged_data$`Daily Mean PM2.5 Concentration`<-NULLmerged_data$lat <- merged_data$SITE_LATITUDEmerged_data$SITE_LATITUDE <-NULLmerged_data$lon <- merged_data$SITE_LONGITUDEmerged_data$SITE_LONGITUDE <-NULLstr(merged_data)
Classes 'data.table' and 'data.frame': 72116 obs. of 21 variables:
$ Date : chr "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
$ Source : chr "AQS" "AQS" "AQS" "AQS" ...
$ Site ID : int 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
$ POC : int 1 1 1 1 1 1 1 1 1 1 ...
$ UNITS : chr "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
$ DAILY_AQI_VALUE : int 78 92 71 80 98 115 87 57 65 107 ...
$ Site Name : chr "Livermore" "Livermore" "Livermore" "Livermore" ...
$ DAILY_OBS_COUNT : int 1 1 1 1 1 1 1 1 1 1 ...
$ PERCENT_COMPLETE : num 100 100 100 100 100 100 100 100 100 100 ...
$ AQS_PARAMETER_CODE: int 88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
$ AQS_PARAMETER_DESC: chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
$ CBSA_CODE : int 41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
$ CBSA_NAME : chr "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
$ STATE_CODE : int 6 6 6 6 6 6 6 6 6 6 ...
$ STATE : chr "California" "California" "California" "California" ...
$ COUNTY_CODE : int 1 1 1 1 1 1 1 1 1 1 ...
$ COUNTY : chr "Alameda" "Alameda" "Alameda" "Alameda" ...
$ year : num 2002 2002 2002 2002 2002 ...
$ PM2.5 : num 25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
$ lat : num 37.7 37.7 37.7 37.7 37.7 ...
$ lon : num -122 -122 -122 -122 -122 ...
- attr(*, ".internal.selfref")=<externalptr>
Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
The monitoring sites are distributed throughout the entire state of California, with the highest density of sites in the San Francisco area and the Los Angeles / San Diego area. In addition, there is a much higher number of sites in 2022 as compared to 2022.
Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
There are 0 missing values and 207 implausible values of PM 2.5 in the combined dataset.
The implausible values are all negative values (-2.2 to -0.1), which as previously mentioned, does not make sense for a concentration variable. All of the implausible values are from the year 2022 and do not have a date associated with them.
Date Source Site ID POC
Length:207 Length:207 Min. :60010011 Min. :1.00
Class :character Class :character 1st Qu.:60292009 1st Qu.:3.00
Mode :character Mode :character Median :60651016 Median :3.00
Mean :60616431 Mean :2.56
3rd Qu.:60832004 3rd Qu.:3.00
Max. :61130004 Max. :4.00
UNITS DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT
Length:207 Min. :0 Length:207 Min. :1
Class :character 1st Qu.:0 Class :character 1st Qu.:1
Mode :character Median :0 Mode :character Median :1
Mean :0 Mean :1
3rd Qu.:0 3rd Qu.:1
Max. :0 Max. :1
PERCENT_COMPLETE AQS_PARAMETER_CODE AQS_PARAMETER_DESC CBSA_CODE
Min. :100 Min. :88101 Length:207 Min. :12540
1st Qu.:100 1st Qu.:88101 Class :character 1st Qu.:33045
Median :100 Median :88101 Mode :character Median :40900
Mean :100 Mean :88239 Mean :35740
3rd Qu.:100 3rd Qu.:88502 3rd Qu.:42020
Max. :100 Max. :88502 Max. :47300
NA's :19
CBSA_NAME STATE_CODE STATE COUNTY_CODE
Length:207 Min. :6 Length:207 Min. : 1.0
Class :character 1st Qu.:6 Class :character 1st Qu.: 29.0
Mode :character Median :6 Mode :character Median : 65.0
Mean :6 Mean : 61.5
3rd Qu.:6 3rd Qu.: 83.0
Max. :6 Max. :113.0
COUNTY year PM2.5 lat
Length:207 Min. :2022 Min. :-2.2000 Min. :32.84
Class :character 1st Qu.:2022 1st Qu.:-0.7500 1st Qu.:34.84
Mode :character Median :2022 Median :-0.4000 Median :37.06
Mean :2022 Mean :-0.5324 Mean :36.95
3rd Qu.:2022 3rd Qu.:-0.2000 3rd Qu.:38.61
Max. :2022 Max. :-0.1000 Max. :41.76
lon
Min. :-124.2
1st Qu.:-122.1
Median :-121.2
Mean :-120.5
3rd Qu.:-118.9
Max. :-115.5
Explore the main question of interest at three different spatial levels. Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
State: At the state level, the average PM 2.5 concentration was higher in 2002 (16.11) than in 2022 (8.52). However, both the summary statistics and the boxplot show that the range in 2002 (0 to 104.3) was much larger than in 2022 (-2.2 to 302.5).
# Exploratory Plot ggplot(merged_data, aes(x =as.factor(year), y = PM2.5)) +geom_boxplot() +labs(title ="Average PM 2.5 Concentration at State Level",x ="Year",y ="Average PM 2.5 Concentration")
County: At the county level, most of the 51 counties had higher average PM 2.5 concentrations in 2002 as compared to 2022. There were 5 counties that were exceptions to that trend: Del Norte (3.81 in 2002 and 4.96 in 2022), Mendocino (8.84 in 2002 and 10.13 in 2022), Mono (2.68 in 2002 and 4.69 in 2022), Siskiyou (2.69 in 2002 and 7.59 in 2022), Trinity (2.78 in 2002 and 10.72 in 2022). The line plot shows that these counties with an upward trend from 2002 to 2022 had some of the lowest starting values in 2002.